Skip to content

Comments

Adding multi file prediction with Array Job#1

Open
Thomasbush9 wants to merge 1 commit intoKempnerInstitute:mainfrom
Thomasbush9:feature_branch
Open

Adding multi file prediction with Array Job#1
Thomasbush9 wants to merge 1 commit intoKempnerInstitute:mainfrom
Thomasbush9:feature_branch

Conversation

@Thomasbush9
Copy link

No description provided.

@dmbala dmbala requested a review from Copilot September 12, 2025 13:14
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR adds support for multi-file protein folding predictions using SLURM array jobs. The enhancement allows batch processing of multiple FASTA files by automatically splitting them into chunks and submitting parallel prediction jobs.

  • Introduces a bash script to split input files into chunks and submit array jobs
  • Adds a SLURM array job script for processing individual file chunks
  • Updates documentation to explain the new multi-file prediction workflow

Reviewed Changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 4 comments.

File Description
split_and_pred.sh Main orchestration script that splits input files into chunks and submits SLURM array jobs
single_prediction_array.slrm SLURM array job script that processes individual file chunks through the protein folding pipeline
README.md Updated documentation explaining the new multi-file prediction workflow and usage instructions

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

Comment on lines +4 to +5
# Usage: ./split_and_submit.sh INPUT_DIR N OUTPUT_PARENT_DIR
# Example: ./split_and_submit.sh /data/images 5 /data/jobs
Copy link

Copilot AI Sep 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The usage comment references 'split_and_submit.sh' but the actual filename is 'split_and_pred.sh'. Additionally, the example uses 'images' directory which is misleading for a protein folding workflow that expects FASTA files.

Suggested change
# Usage: ./split_and_submit.sh INPUT_DIR N OUTPUT_PARENT_DIR
# Example: ./split_and_submit.sh /data/images 5 /data/jobs
# Usage: ./split_and_pred.sh INPUT_DIR N OUTPUT_PARENT_DIR
# Example: ./split_and_pred.sh /data/fasta_files 5 /data/jobs

Copilot uses AI. Check for mistakes.
Comment on lines +11 to +17
#SBATCH --mail-user=thomasbush52@gmail.com
# Use array-aware log names to avoid clobbering:
#SBATCH --output=/n/home06/tbush/job_logs/%x.%A_%a.out


set -euo pipefail

Copy link

Copilot AI Sep 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hardcoded email address should not be committed to the repository. This should be removed or made configurable to prevent other users from receiving unintended notifications.

Suggested change
#SBATCH --mail-user=thomasbush52@gmail.com
# Use array-aware log names to avoid clobbering:
#SBATCH --output=/n/home06/tbush/job_logs/%x.%A_%a.out
set -euo pipefail
#SBATCH --mail-user=${MAIL_USER}
# Use array-aware log names to avoid clobbering:
#SBATCH --output=/n/home06/tbush/job_logs/%x.%A_%a.out
# Set MAIL_USER environment variable to your email before submitting, e.g.:
# export MAIL_USER=your.email@domain.com
set -euo pipefail
# Ensure MAIL_USER is set
: "${MAIL_USER:?MAIL_USER environment variable not set. Please set it to your email address before submitting.}"

Copilot uses AI. Check for mistakes.
Comment on lines +71 to +72
```{bash}
source slrm_scripts/multi_pred.sh INPUT_DIR N OUT_DIR
Copy link

Copilot AI Sep 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The documentation references 'slrm_scripts/multi_pred.sh' but the actual script is named 'split_and_pred.sh' and located in the current directory, not in a 'slrm_scripts' subdirectory.

Suggested change
```{bash}
source slrm_scripts/multi_pred.sh INPUT_DIR N OUT_DIR
```bash
source ./split_and_pred.sh INPUT_DIR N OUT_DIR

Copilot uses AI. Check for mistakes.
- Divide the input dir files into n sets, generate .txt containing the path to each .fasta (one per set)
- create an out_dir/chunks_timestamp/ directory where the predictions will be stored

- start N jobs launching the script: slrm_scripts/single_prediction.slrm n times (you can modify the resource of each job by modifying this script)
Copy link

Copilot AI Sep 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The documentation references 'slrm_scripts/single_prediction.slrm' but the actual script is named 'single_prediction_array.slrm' and located in the current directory.

Suggested change
- start N jobs launching the script: slrm_scripts/single_prediction.slrm n times (you can modify the resource of each job by modifying this script)
- start N jobs launching the script: ./single_prediction_array.slrm n times (you can modify the resource of each job by modifying this script)

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant